Chapter 3 Exploratory Data Analysis
3.1 Start with dplyr counts and summaries in console
David Robinson first explores new data with simple counts in the console.
Here we don’t use the package name (so breaking the rule I just told you) so we can quickly explore the data by typing dplyr verbs quickly
3.2 Next plot data points
After using count(), group_by() and summarise() plot all data points with ggplot2::geom_point(). It almost NEVER fails to show you what’s going on and is unlikely to return errors.
This is the minimum and most reliable ggplot code to start with. Let’s look at all the values of sales for each date.
## Warning: Removed 568 rows containing missing values (geom_point).

- Then look at sales over the values of any other dimensions. There is one other dimension city.
## Warning: Removed 568 rows containing missing values (geom_point).

But those points look a bit crowded. Whenever the dots overlap replace geom_point() with geom_jitter().
And we make the dots lighter using a non-intuitive parameter called alpha.
## Warning: Removed 568 rows containing missing values (geom_point).

Of course we know sales of most things vary by season. Let’s put date on the x axis, make city the colour, and because the data is over time we can join those dots using ggplot2::geom_line()
We’re also using the reduced data set so it’s not too crowded for now.
df_red %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = sales,
colour = city
) +
ggplot2::geom_line()
- Beautiful, while sales have very different volumes between cities we can see they tightly follow the same seasonal pattern. But the are on different scales so harder to compare the patterns. One option Wickham does is to log transform the sales value.
df_red %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = base::log(sales),
colour = city
) +
ggplot2::geom_line()
3.3 Why start with geom_point()?
We start with ggplot2::geom_point() because it works for for both raw and summrarised data straight away.
For example, here is raw granular data where each row describes a person getting married.
marriage <-
mosaicData::Marriage %>%
tidylog::mutate(prev_marriage = as.character(prevconc)) %>%
tidylog::mutate(prev_marriage = case_when(
is.na(prev_marriage) ~ "First Time",
TRUE ~ prev_marriage
)) %>%
tidylog::mutate(ceremonydate1 = lubridate::parse_date_time(ceremonydate, "mdy"))## mutate: new variable 'prev_marriage' with 3 unique values and 49% NA
## mutate: changed 48 values (49%) of 'prev_marriage' (48 fewer NA)
## mutate: new variable 'ceremonydate1' with 49 unique values and 0% NA
kableExtra::kable(utils::head(marriage %>%
dplyr::select(ceremonydate1, person, prev_marriage, age, race, sign)))| ceremonydate1 | person | prev_marriage | age | race | sign |
|---|---|---|---|---|---|
| 1996-11-09 | Groom | First Time | 32.60274 | White | Aries |
| 1996-11-12 | Groom | Divorce | 32.29041 | White | Leo |
| 1996-11-27 | Groom | Divorce | 34.79178 | Hispanic | Pisces |
| 1996-12-07 | Groom | Divorce | 40.57808 | Black | Gemini |
| 1996-12-14 | Groom | First Time | 30.02192 | White | Saggitarius |
| 1996-12-26 | Groom | First Time | 26.86301 | White | Pisces |
Before we had one value per city and date so geom_line worked fine as long as the date and the city were in the “aesthetics” of the plot (e.g. x,y, colour, category, or facet being the most common).
A common error with raw data I keep making is to try and put it into a bar or line chart straight away then get confused by the error or the chart.
## Error: stat_count() must not be used with a y aesthetic.

- ggplot2::geom_col() would be a better choice but we still have to think to much about what it’s showing. Yes a bar plot might be the right choice for our final plot but it’s sometimes troublesome when we want to explore quickly.

- Here we use ggplot2::geom_point() and facet by person
marriage %>%
ggplot2::ggplot() +
ggplot2::aes(
x = prev_marriage,
y = age
) +
ggplot2::facet_wrap(~person) +
ggplot2::geom_point(alpha = 0.3)
Immediately this is interesting as we see the ages of brides and grooms and what happened in their previous marriage! Death, Divorce or Unknown!
Based on the distribution of those points that are so pleasingly intuitive we can go on to experiment with different classic ways to represent the distributions we see. So first histograms
marriage %>%
ggplot2::ggplot() +
ggplot2::aes(x = age) +
ggplot2::facet_wrap(~ person + prev_marriage) +
ggplot2::geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

- Or the density distribution with ggplot2::geom_density()
p <-
marriage %>%
ggplot2::ggplot() +
ggplot2::aes(
x = age,
fill = prev_marriage
) +
ggplot2::geom_density(adjust = 1, alpha = 0.5, colour = NA) +
ggplot2::facet_wrap(vars(person), ncol = 1) +
ggplot2::theme_minimal()
directlabels::direct.label(p, list("top.points", cex = .75, hjust = 0, vjust = -0.2))
- However, with so few data points I personally prefer the geom_point chart.
3.4 Now facet by categories
Another logical step after showing categories by colour is to use “small multiples”. This is a fancy way of saying draw a chart for each category and look at them all at once in a grid. An important setting here is to specify scales = “free” so they are their own scale and we can study what’s going on in each city.
This lets us more easily spot interesting differences in the seasonal pattern between cities.
df_red %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = sales,
colour = city
) +
ggplot2::geom_line() +
ggplot2::facet_wrap(~city,
scales = "free"
)
3.5 Facet with trelliscopejs
Another powerful way to facet or create small multiples for your data exploration is trelliscope. Here we look at al the US cities adding a facet by city.
And it lets you play around with the data further. Have a go on this below and see what it does.
3.6 Or loop and plot every value
- Or to really study each chart, nest the data into a data frame of dataframes for each city. Then loop through each one and creating a plot in the data frame we plot.
df_red_nest_plot <-
df_red_nest %>%
dplyr::mutate(plot = purrr::map2(
.x = data,
.y = city,
~ ggplot2::ggplot(
data = .x,
aes(
x = date,
y = sales
)
) +
ggtitle(glue("Plot of {.y}")) +
geom_line()
))## [[1]]

##
## [[2]]

##
## [[3]]

##
## [[4]]

##
## [[5]]

##
## [[6]]

##
## [[7]]

##
## [[8]]

##
## [[9]]

##
## [[10]]

##
## [[11]]

3.7 Polish your final plot
We now have a bare minimum Exploratory Data Analysis toolkit of how to explore the data from the console using View(), and then looking at the data points, followed by some line plots.
We could soon be ready to decide on the plot we want that tells and interesting story. But adding in all the bells and whistles to make it ready for a customer or a publication can take ages. It shouldn’t be part of your exploratory data analysis.
Also, we should use a code style recommended before that lays out your code cleanly. It’s far quicker then to comment out or tweak the values of each part of your plot until it looks just right.
I won’t explain each line below other than to say you can run it in chunks to understand it like the popular ggplot flip-books.
# a list of dates to add vertical lines to the plot
years <- base::seq.Date(
from = as.Date("2000-01-01"),
to = as.Date("2015-01-01"),
by = "years"
)
df %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = sales,
colour = city
) +
ggplot2::geom_line(size = 1) +
ggplot2::theme_minimal() +
gghighlight::gghighlight(base::max(sales) > 5000, # highlight only cities with higher sales
label_params = list(size = 4)
) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::scale_x_date(
date_breaks = "1 year",
labels = scales::date_format("%b %Y"),
limits = c(
as.Date("2000-01-01"),
as.Date("2015-07-01")
)
) +
ggplot2::labs(
title = "US Housing Sales",
subtitle = "US cities with more than 5,000 sales in any month",
caption = "Source: ggplot2 built in txhousing data set",
x = "Month",
y = "Volume of Sales"
) +
ggplot2::geom_vline(
xintercept = years,
linetype = 4
) +
ggplot2::theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
strip.text.x = element_text(size = 10),
axis.text.x = element_text(
angle = 60,
hjust = 1,
size = 9
),
legend.text = element_text(size = 12),
legend.position = "right",
legend.direction = "vertical",
plot.title = element_text(
size = 22,
face = "bold"
),
plot.subtitle = element_text(
color = "grey",
size = 18
),
plot.caption = element_text(
hjust = 0,
size = 12,
color = "darkgrey"
),
legend.title = element_blank()
)## label_key: city
## Warning: Removed 430 rows containing missing values (geom_path).

So this isn’t necessarily a good plot. There’s things wrong with it I expect you’ll want to change. But with this clear ladder of code you can more quickly read, edit, comment chunks out, or run in chunks from the top down.